Collaborators

Include the names of your collaborators here.

Overview

This homework assignment is focused on model complexity and the influence of the prior regularization strength. You will fit non-Bayesian and Bayesian linear models, compare them, and make predictions to visualize the trends. You will use multiple prior strengths to study the impact on the coefficient posteriors and on the posterior predictive distributions.

You are also introduced to non-Bayesian regularization with Lasso regression via the glmnet package. If you do not have glmnet installed please download it before starting the assignment.

IMPORTANT: code chunks are created for you. Each code chunk has eval=FALSE set in the chunk options. You MUST change it to be eval=TRUE in order for the code chunks to be evaluated when rendering the document.

You are allowed to add as many code chunks as you see fit to answer the questions.

Load packages

This assignment will use packages from the tidyverse suite as well as the coefplot package. Those packages are imported for you below.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(coefplot)

This assignment also uses the splines and MASS packages. Both are installed with base R and so you do not need to download any additional packages to complete the assignment.

The last question in the assignment uses the glmnet package. As stated previously, please download and install glmnet if you do not currently have it.

Problem 01

You will fit and compare 6 models of varying complexity using non-Bayesian methods. The unknown parameters will be be estimated by finding their Maximum Likelihood Estimates (MLE). You are allowed to the lm() function for this problem.

The data are loaded in the code chunk and a glimpse is shown for you below. There are 2 continuous inputs, x1 and x2, and a continuous response y.

data_url <- 'https://raw.githubusercontent.com/jyurko/INFSCI_2595_Spring_2022/main/HW/08/hw08_data.csv'

df <- readr::read_csv(data_url, col_names = TRUE)
## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x1, x2, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df %>% glimpse()
## Rows: 100
## Columns: 3
## $ x1 <dbl> -0.30923281, 0.63127211, -0.68276690, 0.26930562, 0.37252021, 1.296…
## $ x2 <dbl> 0.308779853, -0.547919793, 2.166449412, 1.209703658, 0.785485991, -…
## $ y  <dbl> 0.43636596, 1.37562976, -0.84366730, -0.43080811, 0.77456951, 1.361…

1a)

Create a scatter plot between the response, y, and each input using ggplot().

Based on the visualizations, do you think there are trends between either input and the response?

SOLUTION

df%>%ggplot()+
    geom_point(aes(x1,y),color="blue")+
    geom_point(aes(x2,y),color="red")

1b)

You will fit multiple models of varying complexity in this problem. You will start with linear additive features.

Fit a model with linear additive features to predict the response, y. Use the formula interface and the lm() function to fit the model. Assign the result to the mod01 object.

Visualize the coefficient summaries with the coefplot() function. Are any of the features statistically significant?

SOLUTION

### add more code chunks if you like
mod01<-lm(y~x1:x2,df)
coefplot(mod01)

summary(mod01)
## 
## Call:
## lm(formula = y ~ x1:x2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8104 -0.5355  0.1004  0.7187  1.7749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.01905    0.09929   0.192   0.8483  
## x1:x2       -0.17845    0.09548  -1.869   0.0646 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9876 on 98 degrees of freedom
## Multiple R-squared:  0.03441,    Adjusted R-squared:  0.02456 
## F-statistic: 3.493 on 1 and 98 DF,  p-value: 0.06462

From the diagram we can see both x1 and x2 are statistically significant. ### 1c)

As discussed in lecture, we can derive features from inputs. We have worked with polynomial features and spline-based features in previous assignments. Features can also be derived as the products between different inputs. A feature calculated as the product of multiple inputs is usually referred to as the interaction between those inputs.

In the formula interface, a product of two inputs is denoted by the :. And so if we wanted to include just the multiplication of x1 and x2 in a model we would type, x1:x2. We can then include main-effect terms by including the additive features within the formula. Thus, the formula for a model with additive features and the interaction between x1 and x2 is:

y ~ x1 + x2 + x1:x2

However, the formula interface provides a short-cut to create main effects and interaction features. In the formula interface, the * operator will generate all main-effects and all interactions for us.

Fit a model with all main-effect and all-interaction features between x1 and x2 using the short-cut * operator within the formula interface. Assign the result to the mod02 object.

Visualize the coefficient summaries with the coefplot() function. How many features are present in the model? Are any of the features statistically significant?

SOLUTION

4 features are present in the model.

### add more code chunks if you like
mod02<-lm(y~x1+x2+x1*x2,df)
coefplot(mod02)

summary(mod02)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2016 -0.6424 -0.0518  0.8017  1.7827 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.02413    0.09616   0.251   0.8024  
## x1           0.23061    0.09769   2.361   0.0203 *
## x2          -0.19364    0.09734  -1.989   0.0495 *
## x1:x2       -0.22607    0.09389  -2.408   0.0180 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9564 on 96 degrees of freedom
## Multiple R-squared:  0.113,  Adjusted R-squared:  0.08529 
## F-statistic: 4.077 on 3 and 96 DF,  p-value: 0.009004

There are 4 features in the model, x1,x2, and x1:x2 are statistically significant.

1d)

The * operator will interact more than just inputs. We can interact expressions or groups of features together. To interact one group of features by another group of features, we just need to enclose each group by parenthesis, (), and separate them by the * operator. The line of code below shows how this works with the <expression 1> and <expression 2> as place holders for any expression we want to use.

(<expression 1>) * (<expression 2>)

Fit a model which interacts linear and quadratic features from x1 with linear and quadratic features from x2. Assign the result to the mod03 object.

Visualize the coefficient summaries with the coefplot() function. How many features are present in the model? Are any of the features statistically significant?

HINT: Remember to use the I() function when typing polynomials in the formula interface.

SOLUTION

### add more code chunks if you like
mod03<-lm(y~(x1+I(x1^2))*(x2+I(x2^2)),df)
coefplot(mod03)

summary(mod03)
## 
## Call:
## lm(formula = y ~ (x1 + I(x1^2)) * (x2 + I(x2^2)), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68423 -0.41328 -0.00785  0.46938  1.29854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.665298   0.123052   5.407 5.13e-07 ***
## x1               0.164473   0.092344   1.781   0.0782 .  
## I(x1^2)         -0.160376   0.091124  -1.760   0.0818 .  
## x2              -0.051980   0.094540  -0.550   0.5838    
## I(x2^2)         -0.556463   0.078811  -7.061 3.21e-10 ***
## x1:x2            0.122791   0.115453   1.064   0.2903    
## x1:I(x2^2)      -0.082521   0.073847  -1.117   0.2667    
## I(x1^2):x2       0.005341   0.069932   0.076   0.9393    
## I(x1^2):I(x2^2)  0.020387   0.061572   0.331   0.7413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7024 on 91 degrees of freedom
## Multiple R-squared:  0.5465, Adjusted R-squared:  0.5067 
## F-statistic: 13.71 on 8 and 91 DF,  p-value: 7.252e-13

There are 9 features in the model. intercept,x1,x12,x22 are statistically significant, and x1:x2 and x1:I(x2^2) are boundary features. ### 1e)

Let’s now try a more complicated model.

Fit a model which interacts linear, quadratic, cubic, quartic (4th degree) polynomial features from x1 with linear, quadratic, cubic, and quartic (4th degree) polynomial features from x2. Assign the result to the mod04 object.

Visualize the coefficient summaries with the coefplot() function. Are any of the features statistically significant?

SOLUTION

### add more code chunks if you like
mod04<-lm(y~(x1 + I(x1^2) + I(x1^3) + I(x1^4)) * (x2 + I(x2^2) + I(x2^3) + I(x2^4)),df)
coefplot(mod04)

summary(mod04)
## 
## Call:
## lm(formula = y ~ (x1 + I(x1^2) + I(x1^3) + I(x1^4)) * (x2 + I(x2^2) + 
##     I(x2^3) + I(x2^4)), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71734 -0.39009 -0.01053  0.44375  1.23420 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.68793    0.19921   3.453 0.000914 ***
## x1               0.02109    0.28792   0.073 0.941801    
## I(x1^2)         -0.28947    0.43004  -0.673 0.502934    
## I(x1^3)          0.09512    0.14427   0.659 0.511716    
## I(x1^4)          0.07392    0.14924   0.495 0.621834    
## x2              -0.09726    0.26291  -0.370 0.712459    
## I(x2^2)         -0.53249    0.38267  -1.392 0.168183    
## I(x2^3)         -0.03745    0.10763  -0.348 0.728814    
## I(x2^4)         -0.05254    0.10839  -0.485 0.629255    
## x1:x2           -0.04748    0.45411  -0.105 0.917004    
## x1:I(x2^2)       0.22877    0.61420   0.372 0.710596    
## x1:I(x2^3)       0.01137    0.30032   0.038 0.969909    
## x1:I(x2^4)      -0.13716    0.22062  -0.622 0.536044    
## I(x1^2):x2      -0.13254    0.49896  -0.266 0.791256    
## I(x1^2):I(x2^2)  0.16030    0.70932   0.226 0.821821    
## I(x1^2):I(x2^3)  0.31559    0.38694   0.816 0.417309    
## I(x1^2):I(x2^4)  0.02303    0.17621   0.131 0.896354    
## I(x1^3):x2       0.05602    0.22590   0.248 0.804826    
## I(x1^3):I(x2^2) -0.29702    0.37187  -0.799 0.426982    
## I(x1^3):I(x2^3)  0.12392    0.21929   0.565 0.573698    
## I(x1^3):I(x2^4)  0.03879    0.12146   0.319 0.750330    
## I(x1^4):x2       0.05509    0.15499   0.355 0.723251    
## I(x1^4):I(x2^2) -0.12017    0.22652  -0.531 0.597329    
## I(x1^4):I(x2^3) -0.04193    0.12638  -0.332 0.741009    
## I(x1^4):I(x2^4) -0.02831    0.07819  -0.362 0.718334    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7278 on 75 degrees of freedom
## Multiple R-squared:  0.5987, Adjusted R-squared:  0.4703 
## F-statistic: 4.663 on 24 and 75 DF,  p-value: 1.506e-07

Only intercept and I(x2^2) are statistically significant in the 25 features.

1f)

Let’s try using spline based features. We will use a high degree-of-freedom natural spline applied to x1 and interact those features with polynomial features derived from x2.

Fit a model which interacts 12 degree-of-freedom natural spline from x1 with linear and quadrtic polyonomial features from x2. Assign the result to mod05.

Visualize the coefficient summaries with the coefplot() function. Are any of the features statistically significant?

SOLUTION

### add more code chunks if you like
mod05<-lm(y~ (splines::ns(x1,12))*(x2 + I(x2^2)),df)
coefplot(mod05)

summary(mod05)
## 
## Call:
## lm(formula = y ~ (splines::ns(x1, 12)) * (x2 + I(x2^2)), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52744 -0.26855  0.02001  0.26426  1.51210 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   -3.25456    3.36494  -0.967    0.337
## splines::ns(x1, 12)1           2.83930    2.98495   0.951    0.345
## splines::ns(x1, 12)2           3.99458    3.68155   1.085    0.282
## splines::ns(x1, 12)3           3.02338    3.30619   0.914    0.364
## splines::ns(x1, 12)4           4.85956    3.48455   1.395    0.168
## splines::ns(x1, 12)5           3.84716    3.40463   1.130    0.263
## splines::ns(x1, 12)6           4.14780    3.50392   1.184    0.241
## splines::ns(x1, 12)7           3.70490    3.42926   1.080    0.284
## splines::ns(x1, 12)8           4.13779    3.46708   1.193    0.237
## splines::ns(x1, 12)9           3.33123    3.49147   0.954    0.344
## splines::ns(x1, 12)10          2.88895    1.90966   1.513    0.135
## splines::ns(x1, 12)11          8.01099    7.36634   1.088    0.281
## splines::ns(x1, 12)12          0.50155    1.56499   0.320    0.750
## x2                            -0.41749    1.04879  -0.398    0.692
## I(x2^2)                        0.98044    2.36146   0.415    0.679
## splines::ns(x1, 12)1:x2        0.11828    1.13850   0.104    0.918
## splines::ns(x1, 12)2:x2        1.05316    1.15711   0.910    0.366
## splines::ns(x1, 12)3:x2       -0.79683    1.36943  -0.582    0.563
## splines::ns(x1, 12)4:x2        0.46210    1.18708   0.389    0.698
## splines::ns(x1, 12)5:x2        0.23777    1.35755   0.175    0.862
## splines::ns(x1, 12)6:x2        0.84594    1.32545   0.638    0.526
## splines::ns(x1, 12)7:x2        0.65015    1.23411   0.527    0.600
## splines::ns(x1, 12)8:x2       -1.74980    1.60727  -1.089    0.281
## splines::ns(x1, 12)9:x2        2.63150    1.75578   1.499    0.139
## splines::ns(x1, 12)10:x2      -1.45540    1.63664  -0.889    0.377
## splines::ns(x1, 12)11:x2       0.70418    2.34487   0.300    0.765
## splines::ns(x1, 12)12:x2       0.95791    1.89944   0.504    0.616
## splines::ns(x1, 12)1:I(x2^2)  -0.94898    2.08251  -0.456    0.650
## splines::ns(x1, 12)2:I(x2^2)  -1.90693    2.50487  -0.761    0.449
## splines::ns(x1, 12)3:I(x2^2)   0.45647    2.41980   0.189    0.851
## splines::ns(x1, 12)4:I(x2^2)  -2.87188    2.44950  -1.172    0.246
## splines::ns(x1, 12)5:I(x2^2)  -0.29542    2.48149  -0.119    0.906
## splines::ns(x1, 12)6:I(x2^2)  -3.04026    2.53568  -1.199    0.235
## splines::ns(x1, 12)7:I(x2^2)  -1.23443    2.51382  -0.491    0.625
## splines::ns(x1, 12)8:I(x2^2)  -2.54573    3.18170  -0.800    0.427
## splines::ns(x1, 12)9:I(x2^2)  -2.03575    2.58726  -0.787    0.434
## splines::ns(x1, 12)10:I(x2^2) -0.08262    1.50920  -0.055    0.957
## splines::ns(x1, 12)11:I(x2^2) -3.67140    5.49205  -0.668    0.506
## splines::ns(x1, 12)12:I(x2^2) -0.65090    1.76530  -0.369    0.714
## 
## Residual standard error: 0.6986 on 61 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.5119 
## F-statistic: 3.733 on 38 and 61 DF,  p-value: 2.347e-06

Most of the features are not statistically significant. Some of them are boundary features.

1g)

Let’s fit one final model.

Fit a model which interacts 12 degree-of-freedom natural spline from x1 with linear, quadrtic, cubic, and quartic (4th degree) polyonomial features from x2. Assign the result to mod05.

Visualize the coefficient summaries with the coefplot() function. Are any of the features statistically significant?

SOLUTION

### add more code chunks if you like
mod06<-lm(y~(splines::ns(x1,12))*(x2 +I(x2^2) + I(x2^3) + I(x2^4)),df)
coefplot(mod06)

summary(mod06)
## 
## Call:
## lm(formula = y ~ (splines::ns(x1, 12)) * (x2 + I(x2^2) + I(x2^3) + 
##     I(x2^4)), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.28025 -0.18927  0.00148  0.19677  1.29085 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    -6.4518     7.8129  -0.826   0.4145  
## splines::ns(x1, 12)1            2.6128     7.0565   0.370   0.7134  
## splines::ns(x1, 12)2            9.3081     8.4832   1.097   0.2800  
## splines::ns(x1, 12)3            4.1315     7.4521   0.554   0.5828  
## splines::ns(x1, 12)4            8.9394     8.2127   1.088   0.2838  
## splines::ns(x1, 12)5            6.5833     7.7628   0.848   0.4022  
## splines::ns(x1, 12)6            7.4580     8.0690   0.924   0.3617  
## splines::ns(x1, 12)7            7.1925     7.9314   0.907   0.3707  
## splines::ns(x1, 12)8            7.4106     7.9554   0.932   0.3580  
## splines::ns(x1, 12)9            4.1484     8.0160   0.518   0.6080  
## splines::ns(x1, 12)10           4.8113     4.7286   1.017   0.3159  
## splines::ns(x1, 12)11          16.6143    16.7366   0.993   0.3277  
## splines::ns(x1, 12)12           0.9362     5.7915   0.162   0.8725  
## x2                             -6.5332    23.4513  -0.279   0.7822  
## I(x2^2)                        14.4492    15.5297   0.930   0.3585  
## I(x2^3)                         3.2273    13.9230   0.232   0.8180  
## I(x2^4)                        -6.3553     8.2281  -0.772   0.4451  
## splines::ns(x1, 12)1:x2         6.8936    21.7609   0.317   0.7533  
## splines::ns(x1, 12)2:x2         4.1050    24.2406   0.169   0.8665  
## splines::ns(x1, 12)3:x2         9.6913    23.2146   0.417   0.6789  
## splines::ns(x1, 12)4:x2         5.1349    23.6394   0.217   0.8293  
## splines::ns(x1, 12)5:x2         8.2680    23.5888   0.351   0.7281  
## splines::ns(x1, 12)6:x2         6.0546    23.6615   0.256   0.7995  
## splines::ns(x1, 12)7:x2         6.1047    23.6592   0.258   0.7979  
## splines::ns(x1, 12)8:x2         4.3895    23.8739   0.184   0.8552  
## splines::ns(x1, 12)9:x2         7.7246    24.1189   0.320   0.7507  
## splines::ns(x1, 12)10:x2       -3.0872    13.8332  -0.223   0.8247  
## splines::ns(x1, 12)11:x2       23.1130    50.2685   0.460   0.6485  
## splines::ns(x1, 12)12:x2       20.4546    14.2648   1.434   0.1605  
## splines::ns(x1, 12)1:I(x2^2)   -3.1970    12.5813  -0.254   0.8009  
## splines::ns(x1, 12)2:I(x2^2)  -25.2801    18.8366  -1.342   0.1882  
## splines::ns(x1, 12)3:I(x2^2)   -2.5568    16.3410  -0.156   0.8766  
## splines::ns(x1, 12)4:I(x2^2)  -20.0453    16.2677  -1.232   0.2261  
## splines::ns(x1, 12)5:I(x2^2)  -11.7053    15.6654  -0.747   0.4599  
## splines::ns(x1, 12)6:I(x2^2)  -17.0964    16.1437  -1.059   0.2968  
## splines::ns(x1, 12)7:I(x2^2)  -18.2229    16.1370  -1.129   0.2665  
## splines::ns(x1, 12)8:I(x2^2)  -14.4558    16.9587  -0.852   0.3998  
## splines::ns(x1, 12)9:I(x2^2)    7.3273    18.8787   0.388   0.7003  
## splines::ns(x1, 12)10:I(x2^2) -16.8227     9.9025  -1.699   0.0982 .
## splines::ns(x1, 12)11:I(x2^2) -35.0930    38.3353  -0.915   0.3662  
## splines::ns(x1, 12)12:I(x2^2)   6.7859    23.3920   0.290   0.7735  
## splines::ns(x1, 12)1:I(x2^3)   -5.4974    12.0061  -0.458   0.6499  
## splines::ns(x1, 12)2:I(x2^3)   -1.7861    14.8472  -0.120   0.9049  
## splines::ns(x1, 12)3:I(x2^3)   -6.9984    13.7338  -0.510   0.6135  
## splines::ns(x1, 12)4:I(x2^3)   -1.7398    14.1322  -0.123   0.9027  
## splines::ns(x1, 12)5:I(x2^3)   -4.8691    14.1286  -0.345   0.7324  
## splines::ns(x1, 12)6:I(x2^3)   -2.3285    14.1270  -0.165   0.8700  
## splines::ns(x1, 12)7:I(x2^3)   -2.0034    14.8235  -0.135   0.8933  
## splines::ns(x1, 12)8:I(x2^3)  -10.3383    17.0872  -0.605   0.5491  
## splines::ns(x1, 12)9:I(x2^3)   15.6402    19.0905   0.819   0.4182  
## splines::ns(x1, 12)10:I(x2^3)  -5.9264    10.9143  -0.543   0.5906  
## splines::ns(x1, 12)11:I(x2^3) -47.2825    43.2845  -1.092   0.2821  
## splines::ns(x1, 12)12:I(x2^3) -60.7356    46.7591  -1.299   0.2025  
## splines::ns(x1, 12)1:I(x2^4)    2.8676     6.4499   0.445   0.6593  
## splines::ns(x1, 12)2:I(x2^4)    9.0441     9.1748   0.986   0.3310  
## splines::ns(x1, 12)3:I(x2^4)    3.4147     8.4155   0.406   0.6874  
## splines::ns(x1, 12)4:I(x2^4)    7.2440     8.3448   0.868   0.3913  
## splines::ns(x1, 12)5:I(x2^4)    6.0223     8.2786   0.727   0.4718  
## splines::ns(x1, 12)6:I(x2^4)    5.6567     8.5397   0.662   0.5121  
## splines::ns(x1, 12)7:I(x2^4)   11.7698     9.1591   1.285   0.2072  
## splines::ns(x1, 12)8:I(x2^4)   -6.3990    11.4880  -0.557   0.5811  
## splines::ns(x1, 12)9:I(x2^4)   -5.0341    10.7885  -0.467   0.6437  
## splines::ns(x1, 12)10:I(x2^4)   5.2142     5.6158   0.928   0.3595  
## splines::ns(x1, 12)11:I(x2^4)  36.7168    24.3568   1.507   0.1407  
## splines::ns(x1, 12)12:I(x2^4)  32.9296    20.8713   1.578   0.1236  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7854 on 35 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.3831 
## F-statistic: 1.961 on 64 and 35 DF,  p-value: 0.01638

Most of the features are not statistically significant, few of them are boundary features.

1h)

Now that you have fit multiple models of varying complexity, it is time to identify the best performing model.

Identify the best model considering training set only performance metrics. Which model is best according to R-squared? Which model is best according to AIC? Which model is best according to BIC?

HINT: The brooom::glance() function can be helpful here. The broom package is installed with tidyverse and so you should have it already.

SOLUTION

### add more code chunks if you like
broom::glance(mod01)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1    0.0344        0.0246 0.988      3.49  0.0646     1  -140.  285.  293.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(mod02)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.113        0.0853 0.956      4.08 0.00900     3  -135.  281.  294.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(mod03)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.547         0.507 0.702      13.7 7.25e-13     8  -102.  224.  250.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(mod04)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.599         0.470 0.728      4.66 0.000000151    24  -95.7  243.  311.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(mod05)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.699         0.512 0.699      3.73 0.00000235    38  -81.3  243.  347.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(mod06)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.782         0.383 0.785      1.96  0.0164    64  -65.3  263.  434.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

R-squared values shows mod06 is the best model, and AIC showed mod3 is the best model and based on the BIC value, mod03 is the best model.

Problem 02

Now that you know which model is best, let’s visualize the predictive trends from the six models. This will help us better understand their performance and behavior.

2a)

You will define a prediction or visualization test grid. This grid will allow you to visualize behavior with respect to x1 for multiple values of x2.

Create a grid of input values where x1 consists of 101 evenly spaced points between -3.2 and 3.2 and x2 is 9 evenly spaced points between -3 and 3. The expand.grid() function is started for you and the data type conversion is provided to force the result to be a tibble.

SOLUTION

viz_grid <- expand.grid(x1 =seq(-3.2,3.2,length.out=101) ,
                        x2 = seq(-3,3,length.out=9),
                        KEEP.OUT.ATTRS = FALSE,
                        stringsAsFactors = FALSE) %>% 
  as.data.frame() %>% tibble::as_tibble()

2b)

You will make predictions for each of the models and visualize their trends. A function, tidy_predict(), is created for you which assembles the predicted mean trend, the confidence interval, and the prediction interval into a tibble for you. The result include the input values to streamline making the visualizations.

tidy_predict <- function(mod, xnew)
{
  pred_df <- predict(mod, xnew, interval = "confidence") %>% 
    as.data.frame() %>% tibble::as_tibble() %>% 
    dplyr::select(pred = fit, ci_lwr = lwr, ci_upr = upr) %>% 
    bind_cols(predict(mod, xnew, interval = 'prediction') %>% 
                as.data.frame() %>% tibble::as_tibble() %>% 
                dplyr::select(pred_lwr = lwr, pred_upr = upr))
  
  xnew %>% bind_cols(pred_df)
}

The first argument to the tidy_predict() function is a lm() model object and the second argument is new or test dataframe of inputs. When working with lm() and its predict() method, the functions will create the test design matrix consistent with the training design basis. It does so via the model object’s formula which is contained within the lm() model object. The lm() object therefore takes care of the heavy lifting for us!

Make predictions with each of the six models you fit in Problem 01 using the visualization grid, viz_grid. The predictions should be assigned to the variables pred_lm_01 through pred_lm_06 where the number is consistent with the model number fit previously.

SOLUTION

pred_lm_01 <- tidy_predict(mod01,viz_grid)

pred_lm_02 <- tidy_predict(mod02,viz_grid)

pred_lm_03 <- tidy_predict(mod03,viz_grid)

pred_lm_04 <- tidy_predict(mod04,viz_grid)

pred_lm_05 <- tidy_predict(mod05,viz_grid)

pred_lm_06 <- tidy_predict(mod06,viz_grid)

2c)

You will now visualize the predictive trends and the confidence and prediction intervals for each model. The pred column in of each pred_lm_ objects is the predictive mean trend. The ci_lwr and ci_upr columns are the lower and upper bounds of the confidence interval, respectively. The pred_lwr and pred_upr columns are the lower and upper bounds of the prediction interval, respectively.

You will use ggplot() to visualize the predictions. You will use geom_line() to visualize the mean trend and geom_ribbon() to visualize the uncertainty intervals.

Visualize the predictions of each model on the visualization grid. Pipe the pred_lm_ object to ggplot() and map the x1 variable to the x-aesthetic. Add three geometric object layers. The first and second layers are each geom_ribbon() and the third layer is geom_line(). In the geom_line() layer map the pred variable to the y aesthetic. In the first geom_ribbon() layer, map pred_lwr and pred_upr to the ymin and ymax aesthetics, respectively. Hard code the fill to be orange in the first geom_ribbon() layer (outside the aes() call). In the second geom_ribbon() layer, map ci_lwr and ci_upr to the ymin and ymax aesthetics, respectively. Hard code the fill to be grey in the second geom_ribbon() layer (outside the aes() call). Include facet_wrap() with the facets with controlled by the x2 variable.

To help compare the visualizations across models include a coord_cartesian() layer with the ylim argument set to c(-7,7).

Each model’s prediction visualization should be created in a separate code chunk.

SOLUTION

Create separate code chunks for each visualization.

pred_lm_01%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

pred_lm_02%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

pred_lm_03%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

pred_lm_04%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

pred_lm_05%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

pred_lm_06%>%ggplot(mapping = aes(x=x1))+
  geom_ribbon(mapping = aes(ymin=pred_lwr, ymax=pred_upr),fill="orange")+
  geom_ribbon(mapping =aes(ymin=ci_lwr,ymax=ci_upr),fill="grey")+
  geom_line(mapping = aes(y=pred))+
  facet_wrap(~x2)+
  coord_cartesian(ylim=c(-7,7))

2d)

Do you feel the predictions are consistent with the model performance rankings based on AIC/BIC? What is the defining characteristic of the models considered to be the worst by AIC/BIC?

SOLUTION

What do you think?
The predictions are consistent with the model performance rankings based on AIC/BIC. The model with the lowest AIC/BIC values has good predictions. The defining characteristic of the models has the largest values considered to be the worst by AIC/BIC.

Problem 03

Now that you have fit non-Bayesian linear models with maximum likelihood estimation, it is time to use Bayesian models to understand the influence of the prior on the model behavior.

Regardless of your answers in Problem 02 you will only work with model 3 and model 6 in this problem.

3a)

You will perform the Bayesian analysis using the Laplace Approximation just as you did in the previous assignment. You will define the log-posterior function just as you did in the previous assignment and so before doing so you must create the list of required information. This list will include the observed response, the design matrix, and the prior specification. You will use independent Gaussian priors on the regression parameters with a shared prior mean and shared prior standard deviation. You will use an Exponential prior on the unknown likelihood noise (the \(\sigma\) parameter).

Complete the two code chunks below. In the first, create the design matrix following mod03’s formula, and assign the object to the X03 variable. Complete the info_03_weak list by assigning the response to yobs and the design matrix to design_matrix. Specify the shared prior mean, mu_beta, to be 0, the shared prior standard deviation, tau_beta, as 50, and the rate parameter on the noise, sigma_rate, to be 1.

Complete the second code chunk with the same prior specification. The second code chunk however requires that you create the design matrix associated with mod06’s formula and assign the object to the X06 variable. Assign X06 to the design_matrix field of the info_06_weak list.

SOLUTION

X03 <- model.matrix(y~(x1+I(x1^2))*(x2+I(x2^2)),df)

info_03_weak <- list(
  yobs = df$y,
  design_matrix = X03,
  mu_beta = 0,
  tau_beta = 50,
  sigma_rate = 1
)
X06 <-model.matrix(y~(splines::ns(x1,12))*(x2 +I(x2^2) + I(x2^3) + I(x2^4)),df) 

info_06_weak <- list(
  yobs = df$y ,
  design_matrix =X06 ,
  mu_beta =0 ,
  tau_beta = 50,
  sigma_rate =1 
)

3b)

You will now define the log-posterior function lm_logpost(). You will continue to use the log-transformation on \(\sigma\), and so you will actually define the log-posterior in terms of the mean trend \(\boldsymbol{\beta}\)-parameters and the unbounded noise parameter, \(\varphi = \log\left[\sigma\right]\).

The comments in the code chunk below tell you what you need to fill in. The unknown parameters to learn are contained within the first input argument, unknowns. You will assume that the unknown \(\boldsymbol{\beta}\)-parameters are listed before the unknown \(\varphi\) parameter in the unknowns vector. You must specify the number of \(\boldsymbol{\beta}\) parameters programmatically to allow scaling up your function to an arbitrary number of unknowns. You will assume that all variables contained in the my_info list (the second argument to lm_logpost()) are the same fields in the info_03_weak list you defined in Problem 3a).

Define the log-posterior function by completing the code chunk below. You must calculate the mean trend, mu, using matrix math between the design matrix and the unknown \(\boldsymbol{\beta}\) column vector.

HINT: This function should look very famaliar…

SOLUTION

lm_logpost <- function(unknowns, my_info)
{
  # specify the number of unknown beta parameters
  length_beta <- ncol(my_info$design_matrix) 
  
  # extract the beta parameters from the `unknowns` vector
  beta_v <- unknowns[1:length_beta]
  
  # extract the unbounded noise parameter, varphi
  lik_varphi <- unknowns[length_beta+1] 
  
  # back-transform from varphi to sigma
  lik_sigma <- exp(lik_varphi)
  
  # extract design matrix
  X <- my_info$design_matrix
  
  # calculate the linear predictor
  mu <- as.vector(X %*% as.matrix(beta_v))
  
  # evaluate the log-likelihood
  log_lik <- sum(dnorm(x =my_info$yobs,
                       mean=mu,
                       sd=lik_sigma,
                       log=TRUE))
  
  # evaluate the log-prior
  log_prior_beta <- sum(dnorm(x=beta_v,
                       mean=my_info$mu_beta,
                      sd = my_info$tau_beta,
                      log=TRUE))
  
  log_prior_sigma <-dexp(x=lik_sigma,
                    rate = my_info$sigma_rate,
                    log=TRUE) 
  
  # add the mean trend prior and noise prior together
  log_prior <- log_prior_beta +log_prior_sigma 
  
  # account for the transformation
  log_derive_adjust <-  lik_varphi
  
  # sum together
  log_lik +log_prior+log_derive_adjust
}

3c)

The my_laplace() function is defined for you in the code chunk below. This function executes the laplace approximation and returns the object consisting of the posterior mode, posterior covariance matrix, and the log-evidence.

my_laplace <- function(start_guess, logpost_func, ...)
{
  # code adapted from the `LearnBayes`` function `laplace()`
  fit <- optim(start_guess,
               logpost_func,
               gr = NULL,
               ...,
               method = "BFGS",
               hessian = TRUE,
               control = list(fnscale = -1, maxit = 1001))
  
  mode <- fit$par
  post_var_matrix <- -solve(fit$hessian)
  p <- length(mode)
  int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
  # package all of the results into a list
  list(mode = mode,
       var_matrix = post_var_matrix,
       log_evidence = int,
       converge = ifelse(fit$convergence == 0,
                         "YES", 
                         "NO"),
       iter_counts = as.numeric(fit$counts[1]))
}

Execute the Laplace Approximation for the model 3 formulation and the model 6 formulation. Assign the model 3 result to the laplace_03_weak object, and assign the model 6 result to the laplace_06_weak object. Check that the optimization scheme converged.

SOLUTION

### add more code chunks if you like
laplace_03_weak<-my_laplace(rep(0,ncol(X03)+1), lm_logpost, info_03_weak)
laplace_03_weak$converge
## [1] "YES"
### add more code chunks if you like
laplace_06_weak<-my_laplace(rep(0,ncol(X06)+1), lm_logpost, info_06_weak)
laplace_06_weak$converge
## [1] "YES"

3d)

A function is defined for you in the code chunk below. This function creates a coefficient summary plot in the style of the coefplot() function, but uses the Bayesian results from the Laplace Approximation. The first argument is the vector of posterior means, and the second argument is the vector of posterior standard deviations. The third argument is the name of the feature associated with each coefficient.

viz_post_coefs <- function(post_means, post_sds, xnames)
{
  tibble::tibble(
    mu = post_means,
    sd = post_sds,
    x = xnames
  ) %>% 
    mutate(x = factor(x, levels = xnames)) %>% 
    ggplot(mapping = aes(x = x)) +
    geom_hline(yintercept = 0, color = 'grey', linetype = 'dashed') +
    geom_point(mapping = aes(y = mu)) +
    geom_linerange(mapping = aes(ymin = mu - 2 * sd,
                                 ymax = mu + 2 * sd,
                                 group = x)) +
    labs(x = 'feature', y = 'coefficient value') +
    coord_flip() +
    theme_bw()
}

Create the posterior summary visualization figure for model 3 and model 6. You must provide the posterior means and standard deviations of the regression coefficients (the \(\beta\) parameters). Do NOT include the \(\varphi\) parameter. The feature names associated with the coefficients can be extracted from the design matrix using the colnames() function.

SOLUTION

### make the posterior coefficient visualization for model 3
viz_post_coefs(laplace_03_weak$mode[1:ncol(X03)], sqrt(diag(laplace_03_weak$var_matrix))[1:ncol(X03)], colnames(X03))

### make the posterior coefficient visualization for model 6
viz_post_coefs(laplace_06_weak$mode[1:ncol(X06)], sqrt(diag(laplace_06_weak$var_matrix))[1:ncol(X06)], colnames(X06))

3e)

Use the Bayes Factor to identify the better of the models.

SOLUTION

### add more code chunks if you like
laplace_03_weak$log_evidence/laplace_06_weak$log_evidence
## [1] 0.448033

The result tells that mod06 is the better model.

3f)

You fit the Bayesian models assuming a diffuse or weak prior. Let’s now try a more informative or strong prior by reducing the prior standard deviation on the regression coefficients from 50 to 1. The prior mean will still be zero.

Complete the first code chunk below, which defines the list of required information for both the model 3 and model 6 formulations using the strong prior on the regression coefficients. All other information, data and the \(\sigma\) prior, are the same as before.

Run the Laplace Approximation using the strong prior for both the model 3 and model 6 formulations. Assign the results to laplace_03_strong and laplace_06_strong.

Confirm that the optimizations converged for both laplace approximation results.

SOLUTION

Define the lists of required information for the strong prior.

info_03_strong <- list(
  yobs = df$y,
  design_matrix = X03,
  mu_beta =0 ,
  tau_beta = 1,
  sigma_rate =1 
)

info_06_strong <- list(
  yobs = df$y,
  design_matrix =X06,
  mu_beta = 0,
  tau_beta = 1,
  sigma_rate =1 
)

Execute the Laplace Approximation.

### add more code chunks if you like
laplace_03_strong<-my_laplace(rep(0,ncol(X03)+1), lm_logpost, info_03_strong)
laplace_03_strong$converge
## [1] "YES"
### add more code chunks if you like
laplace_06_strong<-my_laplace(rep(0,ncol(X06)+1), lm_logpost, info_06_strong)
laplace_06_strong$converge
## [1] "YES"

3g)

Use the viz_post_coefs() function to visualize the posterior coefficient summaries for model 3 and model 6, based on the strong prior specification.

SOLUTION

### add more code chunks if you like
viz_post_coefs(laplace_03_strong$mode[1:ncol(X03)], sqrt(diag(laplace_03_strong$var_matrix))[1:ncol(X03)], colnames(X03))

### add more code chunks if you like
viz_post_coefs(laplace_06_strong$mode[1:ncol(X06)], sqrt(diag(laplace_06_strong$var_matrix))[1:ncol(X06)], colnames(X06))

3h)

You will fit one more set of Bayesian models with a very strong prior on the regression coefficients. The prior standard deviation will be equal to 1/50.

Complete the first code chunk below, which defines the list of required information for both the model 3 and model 6 formulations using the very strong prior on the regression coefficients. All other information, data and the \(\sigma\) prior, are the same as before.

Run the Laplace Approximation using the strong prior for both the model 3 and model 6 formulations. Assign the results to laplace_03_very_strong and laplace_06_very_strong.

Confirm that the optimizations converged for both laplace approximation results.

SOLUTION

info_03_very_strong <- list(
  yobs =df$y ,
  design_matrix =X03 ,
  mu_beta = 0,
  tau_beta = 1/50,
  sigma_rate = 1
)

info_06_very_strong <- list(
  yobs =df$y ,
  design_matrix = X06,
  mu_beta = 0,
  tau_beta =1/50 ,
  sigma_rate = 1
)

Execute the Laplace Approximation.

### add more code chunks if you like
laplace_03_very_strong<-my_laplace(rep(0,ncol(X03)+1), lm_logpost, info_03_very_strong)
laplace_03_very_strong$converge
## [1] "YES"
### add more code chunks if you like
laplace_06_very_strong<-my_laplace(rep(0,ncol(X06)+1), lm_logpost, info_06_very_strong)
laplace_06_very_strong$converge
## [1] "YES"

3i)

Use the viz_post_coefs() function to visualize the posterior coefficient summaries for model 3 and model 6, based on the very strong prior specification.

SOLUTION

### add more code chunks if you like
viz_post_coefs(laplace_03_very_strong$mode[1:ncol(X03)], sqrt(diag(laplace_03_very_strong$var_matrix))[1:ncol(X03)], colnames(X03))

### add more code chunks if you like
viz_post_coefs(laplace_06_very_strong$mode[1:ncol(X06)], sqrt(diag(laplace_06_very_strong$var_matrix))[1:ncol(X06)], colnames(X06))

3j)

Describe the influence of the regression coefficient prior standard deviation on the coefficient posterior distributions.

SOLUTION

What do you think?
As shown in figure, the regression coefficient prior standard deviation decreases, the coefficient posterior distribution will appear to stabilize. The regression coefficient small prior standard deviation has constrained the posterior distribution away from the extreme values. As shown in the coefplot of Mod06 with a very string prior,, most of the coefficients are very close to zero.

3k)

You previously compared the two models using the Bayes Factor based on the weak prior specification.

Compare the performance of the two models with Bayes Factors again, but considering the results based on the strong and very strong priors. Does the prior influence which model is considered to be better?

SOLUTIOn

### add more code chunks if you like
laplace_03_strong$log_evidence/laplace_03_very_strong$log_evidence
## [1] 0.9271009
### add more code chunks if you like
laplace_06_strong$log_evidence/laplace_06_very_strong$log_evidence
## [1] 1.219901

According to the results we can see that mod03 with strong prior is not better than the very strong prior. But mod06 the strong prior model is more plausible than the model with very strong prior.

Problem 04

You examined the behavior of the coefficient posterior based on the influence of the prior. Let’s now consider the prior’s influence by examining the posterior predictive distributions.

4a)

You will make posterior predictions following the approach from the previous assignment. Posterior samples are generated and those samples are used to calculate the posterior samples of the mean trend and generate random posterior samples of the response around the mean. In the previous assignment, you made posterior predictions in order to calculate errors. In this assignment, you will not calculate errors, instead you will summarize the posterior predictions of the mean and of the random response.

The generate_lm_post_samples() function is defined for you below. It uses the MASS::mvrnorm() function generate posterior samples from the Laplace Approximation’s MVN distribution.

generate_lm_post_samples <- function(mvn_result, length_beta, num_samples)
{
  MASS::mvrnorm(n = num_samples,
                mu = mvn_result$mode,
                Sigma = mvn_result$var_matrix) %>% 
    as.data.frame() %>% tibble::as_tibble() %>% 
    purrr::set_names(c(sprintf("beta_%02d", 0:(length_beta-1)), "varphi")) %>% 
    mutate(sigma = exp(varphi))
}

The code chunk below starts the post_lm_pred_samples() function. This function generates posterior mean trend predictions and posterior predictions of the response. The first argument, Xnew, is a potentially new or test design matrix that we wish to make predictions at. The second argument, Bmat, is a matrix of posterior samples of the \(\boldsymbol{\beta}\)-parameters, and the third argument, sigma_vector, is a vector of posterior samples of the likelihood noise. The Xnew matrix has rows equal to the number of predictions points, M, and the Bmat matrix has rows equal to the number of posterior samples S.

You must complete the function by performing the necessary matrix math to calculate the matrix of posterior mean trend predictions, Umat, and the matrix of posterior response predictions, Ymat. You must also complete missing arguments to the definition of the Rmat and Zmat matrices. The Rmat matrix replicates the posterior likelihood noise samples the correct number of times. The Zmat matrix is the matrix of randomly generated standard normal values. You must correctly specify the required number of rows to the Rmat and Zmat matrices.

The post_lm_pred_samples() returns the Umat and Ymat matrices contained within a list.

Perform the necessary matrix math to calculate the matrix of posterior predicted mean trends Umat and posterior predicted responses Ymat. You must specify the number of required rows to create the Rmat and Zmat matrices.

HINT: The following code chunk should look famaliar…

SOLUTION

post_lm_pred_samples <- function(Xnew, Bmat, sigma_vector)
{
  # number of new prediction locations
  M <- nrow(Xnew)
  # number of posterior samples
  S <- nrow(Bmat)
  
  # matrix of linear predictors
  Umat <- Xnew%*% t(Bmat)
  
  # assmeble matrix of sigma samples, set the number of rows
  Rmat <- matrix(rep(sigma_vector, M),M, byrow = TRUE)
  
  # generate standard normal and assemble into matrix
  # set the number of rows
  Zmat <- matrix(rnorm(M*S), M, byrow = TRUE)
  
  # calculate the random observation predictions
  Ymat <- Umat + Rmat*Zmat
  
  # package together
  list(Umat = Umat, Ymat = Ymat)
}

4b)

Since this assignment is focused on visualizing the predictions, we will summarize the posterior predictions to focus on the posterior means and the middle 95% uncertainty intervals. The code chunk below is defined for you which serves as a useful wrapper function to call post_lm_pred_samples().

make_post_lm_pred <- function(Xnew, post)
{
  Bmat <- post %>% select(starts_with("beta_")) %>% as.matrix()
  
  sigma_vector <- post %>% pull(sigma)
  
  post_lm_pred_samples(Xnew, Bmat, sigma_vector)
}

The code chunk below defines a function summarize_lm_pred_from_laplace() which manages the actions necessary to summarize posterior predictions. The first argument, mvn_result, is the Laplace Approximation object. The second object is the test design matrix, Xtest, and the third argument, num_samples, is the number of posterior samples to make.

You must complete the code chunk below which summarizes the posterior predictions. This function takes care of most of the coding for you. You do not have to worry about the generation of the posterior samples OR calculating the posterior quantiles associated with the middle 95% uncertainty interval. You must calculate the posterior average by deciding on whether you should use colMeans() or rowMeans() to calculate the average across all posterior samples per prediction location.

Follow the comments in the code chunk below to complete the definition of the summarize_lm_pred_from_laplace() function. You must calculate the average posterior mean trend and the average posterior response.

SOLUTION

summarize_lm_pred_from_laplace <- function(mvn_result, Xtest, num_samples)
{
  # generate posterior samples of the beta parameters
  post <- generate_lm_post_samples(mvn_result, ncol(Xtest), num_samples)
  
  # make posterior predictions on the test set
  pred_test <- make_post_lm_pred(Xtest, post)
  
  # calculate summary statistics on the predicted mean and response
  # summarize over the posterior samples
  
  # posterior mean, should you summarize along rows (rowMeans) or 
  # summarize down columns (colMeans) ???
  mu_avg <- rowMeans(pred_test$Umat)
  y_avg <- rowMeans(pred_test$Ymat)
  
  # posterior quantiles for the middle 95% uncertainty intervals
  mu_lwr <- apply(pred_test$Umat, 1, stats::quantile, probs = 0.025)
  mu_upr <- apply(pred_test$Umat, 1, stats::quantile, probs = 0.975)
  y_lwr <- apply(pred_test$Ymat, 1, stats::quantile, probs = 0.025)
  y_upr <- apply(pred_test$Ymat, 1, stats::quantile, probs = 0.975)
  
  # book keeping
  tibble::tibble(
    mu_avg = mu_avg,
    mu_lwr = mu_lwr,
    mu_upr = mu_upr,
    y_avg = y_avg,
    y_lwr = y_lwr,
    y_upr = y_upr
  ) %>% 
    tibble::rowid_to_column("pred_id")
}

4c)

When you made predictions in Problem 02, the lm() object handled making the test design matrix. However, since we have programmed the Bayesian modeling approach from scratch we need to create the test design matrix manually.

Create the test design matrix based on the visualization grid, viz_grid, using the model 3 formulation. Assign the result to the X03_test object.

Call the summarize_lm_pred_from_laplace() function to summarize the posterior predictions from the model 3 formulation for the weak, strong, and very strong prior specifications. Use 5000 posterior samples for each case. Assign the results from the weak prior to post_pred_summary_viz_03_weak, the results from the strong prior to post_pred_summary_viz_03_strong, and the results from the very strong prior to post_pred_summary_viz_03_very_strong.

SOLUTION

### add as many code chunks as you'd like
X03_test<-model.matrix(~(x1+I(x1^2))*(x2+I(x2^2)),dat=viz_grid)
post_pred_summary_viz_03_weak <- summarize_lm_pred_from_laplace(laplace_03_weak, X03_test, 5000)
post_pred_summary_viz_03_strong <- summarize_lm_pred_from_laplace(laplace_03_strong, X03_test, 5000)
post_pred_summary_viz_03_very_strong <- summarize_lm_pred_from_laplace(laplace_03_very_strong, X03_test, 5000)

4d)

You will now visualize the posterior predictions from the model 3 Bayesian models associated with the weak, strong, and very strong priors. The viz_grid object is joined to the prediction dataframes assuming you have used the correct variable names!

Visualize the predicted means, confidence intervals, and prediction intervals in the style of those that you created in Problem 02. The confidence interval bounds are mu_lwr and mu_upr columns and the prediction interval bounds are the y_lwr and y_upr columns, respectively. The posterior predicted mean of the mean is mu_avg.

Pipe the result of the joined dataframe into ggplot() and make appropriate aesthetics and layers to visualize the predictions with the x1 variable mapped to the x aesthetic and the x2 variable used as a facet variable.

SOLUTION

post_pred_summary_viz_03_weak %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id') %>% ggplot(mapping = aes(x=x1))+ geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

post_pred_summary_viz_03_strong %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id') %>% ggplot(mapping = aes(x = x1)) + 
  geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
  geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

post_pred_summary_viz_03_very_strong %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id') %>% 
  ggplot(mapping = aes(x = x1)) + 
  geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
  geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

4e)

In order to make posterior predictions for the model 6 formulation you must create a test design matrix consistent with the training set basis. The code chunk below creates a helper function which extracts the knots of a natural spline associated with the training set for you. The first argument, J, is the degrees-of-freedom of the spline, the second argument, train_data, is the training data set. The third argument xname is the name of the variable you are applying the spline to. The xname argument must be provided as a character string.

make_splines_training_knots <- function(J, train_data, xname)
{
  x <- train_data %>% select(all_of(xname)) %>% pull()
  
  train_basis <- splines::ns(x, df = J)
  
  as.vector(attributes(train_basis)$knots)
}

Create the test design matrix based on the visualization grid, viz_grid, using the model 6 formulation. Assign the result to the X06_test object. Use the make_splines_training_knots() to get the necessary knots associated with the training set for the x1 variable to create the test design matrix.

Call the summarize_lm_pred_from_laplace() function to summarize the posterior predictions from the model 6 formulation for the weak, strong, and very strong prior specifications. Use 5000 posterior samples for each case. Assign the results from the weak prior to post_pred_summary_viz_06_weak, the results from the strong prior to post_pred_summary_viz_06_strong, and the results from the very strong prior to post_pred_summary_viz_06_very_strong.

SOLUTION

### add as many code chunks as you'd like
X06_test <- model.matrix(~ splines::ns(x1, knots = make_splines_training_knots(12, df, 'x1')) * 
                           (x2 +I(x2^2) + I(x2^3) + I(x2^4)), 
                         data = viz_grid)
post_pred_summary_viz_06_weak <- summarize_lm_pred_from_laplace(laplace_06_weak, X06_test, 5000)
post_pred_summary_viz_06_strong <- summarize_lm_pred_from_laplace(laplace_06_strong, X06_test, 5000)
post_pred_summary_viz_06_very_strong <- summarize_lm_pred_from_laplace(laplace_06_very_strong, X06_test, 5000)

4f)

You will now visualize the posterior predictions from the model 6 Bayesian models associated with the weak, strong, and very strong priors. The viz_grid object is joined to the prediction dataframes assuming you have used the correct variable names!

Visualize the predicted means, confidence intervals, and prediction intervals in the style of those that you created in Problem 02. The confidence interval bounds are mu_lwr and mu_upr columns and the prediction interval bounds are the y_lwr and y_upr columns, respectively. The posterior predicted mean of the mean is mu_avg.

Pipe the result of the joined dataframe into ggplot() and make appropriate aesthetics and layers to visualize the predictions with the x1 variable mapped to the x aesthetic and the x2 variable used as a facet variable.

SOLUTION

post_pred_summary_viz_06_weak %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id')%>% 
  ggplot(mapping = aes(x = x1)) + 
  geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
  geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

post_pred_summary_viz_06_strong %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id')%>% 
  ggplot(mapping = aes(x = x1)) + 
  geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
  geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

post_pred_summary_viz_06_very_strong %>% 
  left_join(viz_grid %>% tibble::rowid_to_column("pred_id"),
            by = 'pred_id')%>% 
  ggplot(mapping = aes(x = x1)) + 
  geom_ribbon(mapping = aes(ymin = y_lwr, ymax = y_upr), fill = "orange") + 
  geom_ribbon(mapping = aes(ymin = mu_lwr, ymax = mu_upr), fill = "grey") + 
  geom_line(mapping = aes(y = mu_avg)) + 
  facet_wrap(~ x2) + 
  coord_cartesian(ylim = c(-7, 7))

4g)

Describe the behavior of the predictions as the prior standard deviation decreased. Are the posterior predictions consistent with the behavior of the posterior coefficients?

SOLUTION

What do you think?
As shown in figure, the regression coefficient prior standard deviation decreases, the coefficient posterior predictions has less relationship with mean trend and is controlled by the strong prior. The posterior predictions are consistent with behavior of the posterior coefficients.

Problem 05

Now that you have worked with Bayesian models with the prior regularizing the coefficients, you will consider non-Bayesian regularization methods. You will work with the glmnet package in this problem which takes care of all fitting and visualization for you.

The code chunk below loads in glmnet and so you must have glmnet installed before running this code chunk. IMPORANT: the eval flag is set to FALSE below. Once you download glmnet set eval=TRUE.

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-3

5a)

glmnet does not work with the formula interface. And so you must create the training design matrix. However, glmnet prefers the the intercept column of ones to not be included in the design matrix. To support that you must define new design matrices. These matrices will use the same formulation but you must remove the intercept column. This is easy to do with the formula interface and the model.matrix() function. Include - 1 in the formula and model.matrix() will not include the intercept. The code chunk below demonstrates removing the intercept column for a model with linear additive features.

model.matrix( y ~ x1 + x2 - 1, data = df) %>% head()
##           x1         x2
## 1 -0.3092328  0.3087799
## 2  0.6312721 -0.5479198
## 3 -0.6827669  2.1664494
## 4  0.2693056  1.2097037
## 5  0.3725202  0.7854860
## 6  1.2966439 -0.1877231

Create the design matrices for glmnet for the model 3 and model 6 formulations. Remove the intercept column for both and assign the results to X03_glmnet and X06_glmnet.

SOLUTION

### add more code chunks if you prefer
X03_glmnet<-model.matrix( y ~(x1+I(x1^2))*(x2+I(x2^2))- 1, data = df)
X06_glmnet<-model.matrix( y ~(splines::ns(x1,12))* (x2 +I(x2^2) + I(x2^3) + I(x2^4))- 1, data = df)

5b)

By default glmnet uses the lasso penalty. Fit a Lasso model by calling glmnet(). The first argument to glmnet() is the design matrix and the second argument is a regular vector for the response.

Train a Lasso model for the model 3 and model 6 formulations, assign the results to lasso_03 and lasso_06, respectively.

SOLUTION

### add more code chunks if you like
lasso_03 <- glmnet(X03_glmnet, df$y)
lasso_06 <- glmnet(X06_glmnet, df$y)

5c)

Plot the coefficient path for each Lasso model by calling the plot() function on the glmnet model object. Specify the xvar argument to be 'lambda' in the plot() call.

SOLUTION

### add more code chunks if you like
plot(lasso_03,xvar="lambda")

### add more code chunks if you like
plot(lasso_06,xvar="lambda")

### 5d)

Now that you have visualized the coefficient path, it’s time to identify the best 'lambda' value to use! The cv.glmnet() function will by default use 10-fold cross-validation to tune 'lambda'. The first argument to cv.glmnet() is the design matrix and the second argument is the regular vector for the response.

Tune the Lasso regularization strength with cross-validation using the cv.glmnet() function for each model formulation. Assign the model 3 result to lasso_03_cv_tune and assign the model 6 result to lasso_06_cv_tune. Also specify the alpha argument to be 1 to make sure the Lasso penalty is applied in the cv.glmnet() call.

SOLUTION

### add more code chunks if you like
lasso_03_cv_tune<-cv.glmnet(X03_glmnet,df$y,alpha=1 )
lasso_06_cv_tune<-cv.glmnet(X06_glmnet,df$y,alpha=1)

5e)

Plot the cross-validation results using the default plot method for each cross-validation result. How many coefficients are remaining after tuning?

SOLUTION

### add more code chunks if you like
plot(lasso_03_cv_tune)

### add more code chunks if you like
plot(lasso_06_cv_tune)

One coefficient is remaining after tuning.

5f)

Which features have NOT been “turned off” by the Lasso penalty? Use the coef() function to display the lasso model cross-validation results to show the tuned penalized regression coefficients for each model.
Are the final tuned models different from each other?

SOLUTION

### add more code chunks if you like
coef(lasso_03_cv_tune)
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)      0.3184159
## x1               .        
## I(x1^2)          .        
## x2               .        
## I(x2^2)         -0.3216322
## x1:x2            .        
## x1:I(x2^2)       .        
## I(x1^2):x2       .        
## I(x1^2):I(x2^2)  .
### add more code chunks if you like
coef(lasso_06_cv_tune)
## 65 x 1 sparse Matrix of class "dgCMatrix"
##                                       s1
## (Intercept)                    0.2818858
## splines::ns(x1, 12)1           .        
## splines::ns(x1, 12)2           .        
## splines::ns(x1, 12)3           .        
## splines::ns(x1, 12)4           .        
## splines::ns(x1, 12)5           .        
## splines::ns(x1, 12)6           .        
## splines::ns(x1, 12)7           .        
## splines::ns(x1, 12)8           .        
## splines::ns(x1, 12)9           .        
## splines::ns(x1, 12)10          .        
## splines::ns(x1, 12)11          .        
## splines::ns(x1, 12)12          .        
## x2                             .        
## I(x2^2)                       -0.2847331
## I(x2^3)                        .        
## I(x2^4)                        .        
## splines::ns(x1, 12)1:x2        .        
## splines::ns(x1, 12)2:x2        .        
## splines::ns(x1, 12)3:x2        .        
## splines::ns(x1, 12)4:x2        .        
## splines::ns(x1, 12)5:x2        .        
## splines::ns(x1, 12)6:x2        .        
## splines::ns(x1, 12)7:x2        .        
## splines::ns(x1, 12)8:x2        .        
## splines::ns(x1, 12)9:x2        .        
## splines::ns(x1, 12)10:x2       .        
## splines::ns(x1, 12)11:x2       .        
## splines::ns(x1, 12)12:x2       .        
## splines::ns(x1, 12)1:I(x2^2)   .        
## splines::ns(x1, 12)2:I(x2^2)   .        
## splines::ns(x1, 12)3:I(x2^2)   .        
## splines::ns(x1, 12)4:I(x2^2)   .        
## splines::ns(x1, 12)5:I(x2^2)   .        
## splines::ns(x1, 12)6:I(x2^2)   .        
## splines::ns(x1, 12)7:I(x2^2)   .        
## splines::ns(x1, 12)8:I(x2^2)   .        
## splines::ns(x1, 12)9:I(x2^2)   .        
## splines::ns(x1, 12)10:I(x2^2)  .        
## splines::ns(x1, 12)11:I(x2^2)  .        
## splines::ns(x1, 12)12:I(x2^2)  .        
## splines::ns(x1, 12)1:I(x2^3)   .        
## splines::ns(x1, 12)2:I(x2^3)   .        
## splines::ns(x1, 12)3:I(x2^3)   .        
## splines::ns(x1, 12)4:I(x2^3)   .        
## splines::ns(x1, 12)5:I(x2^3)   .        
## splines::ns(x1, 12)6:I(x2^3)   .        
## splines::ns(x1, 12)7:I(x2^3)   .        
## splines::ns(x1, 12)8:I(x2^3)   .        
## splines::ns(x1, 12)9:I(x2^3)   .        
## splines::ns(x1, 12)10:I(x2^3)  .        
## splines::ns(x1, 12)11:I(x2^3)  .        
## splines::ns(x1, 12)12:I(x2^3)  .        
## splines::ns(x1, 12)1:I(x2^4)   .        
## splines::ns(x1, 12)2:I(x2^4)   .        
## splines::ns(x1, 12)3:I(x2^4)   .        
## splines::ns(x1, 12)4:I(x2^4)   .        
## splines::ns(x1, 12)5:I(x2^4)   .        
## splines::ns(x1, 12)6:I(x2^4)   .        
## splines::ns(x1, 12)7:I(x2^4)   .        
## splines::ns(x1, 12)8:I(x2^4)   .        
## splines::ns(x1, 12)9:I(x2^4)   .        
## splines::ns(x1, 12)10:I(x2^4)  .        
## splines::ns(x1, 12)11:I(x2^4)  .        
## splines::ns(x1, 12)12:I(x2^4)  .

From the final results we can see in both model, only the intercept and the quadratic(I(x2^2)) featues are not ruled out. The final tuned models are almost the same.